Goto

Collaborating Authors

 multiple gpus


NOMAD Projection

arXiv.org Artificial Intelligence

The rapid adoption of generative AI has driven an explosion in the size of datasets consumed and produced by AI models. Traditional methods for unstructured data visualization, such as t-SNE and UMAP, have not kept up with the pace of dataset scaling. This presents a significant challenge for AI explainability, which relies on methods such as t-SNE and UMAP for exploratory data analysis. In this paper, we introduce Negative Or Mean Affinity Discrimination (NOMAD) Projection, the first method for unstructured data visualization via nonlinear dimensionality reduction that can run on multiple GPUs at train time. W e provide theory that situates NOMAD Projection as an approximate upper bound on the InfoNC-t-SNE loss, and empirical results that demonstrate NOMAD Projection's superior performance and speed profile compared to existing state-of-the-art methods. W e demonstrate the scalability of NOMAD Projection by computing the first complete data map of Multilingual Wikipedia. CVPR 2025 Tutorial - Identifying Structure in Data: All you need to know about Dimensionality Reduction, Clustering, and More 1. Introduction The discovery of neural scaling laws has resulted in an explosion in the size of datasets consumed and produced by AI models [11] [9]. Traditional algorithms for unstructured data visualization, such as t-SNE [14] and UMAP [15], have not kept up with the pace of dataset scaling. The presents a significant challenge for data-centric AI explainability, since it relies upon methods like t-SNE and UMAP for exploratory data analysis.


EBIC: an open source software for high-dimensional and big data biclustering analyses

arXiv.org Machine Learning

Motivation: In this paper we present the latest release of EBIC, a next-generation biclustering algorithm for mining genetic data. The major contribution of this paper is adding support for big data, making it possible to efficiently run large genomic data mining analyses. Additional enhancements include integration with R and Bioconductor and an option to remove influence of missing value on the final result. Results: EBIC was applied to datasets of different sizes, including a large DNA methylation dataset with 436,444 rows. For the largest dataset we observed over 6.6 fold speedup in computation time on a cluster of 8 GPUs compared to running the method on a single GPU. This proves high scalability of the algorithm. Availability: The latest version of EBIC could be downloaded from http://github.com/EpistasisLab/ebic . Installation and usage instructions are also available online.


LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs

arXiv.org Artificial Intelligence

Fine-tuning pre-trained large language models (LLMs) with limited hardware presents challenges due to GPU memory constraints. Various distributed fine-tuning methods have been proposed to alleviate memory constraints on GPU. However, determining the most effective method for achieving rapid fine-tuning while preventing GPU out-of-memory issues in a given environment remains unclear. To address this challenge, we introduce LLMem, a solution that estimates the GPU memory consumption when applying distributed fine-tuning methods across multiple GPUs and identifies the optimal method. We conduct GPU memory usage estimation prior to fine-tuning, leveraging the fundamental structure of transformer-based decoder models and the memory usage distribution of each method. Experimental results show that LLMem accurately estimates peak GPU memory usage on a single GPU, with error rates of up to 1.6%. Additionally, it shows an average error rate of 3.0% when applying distributed fine-tuning methods to LLMs with more than a billion parameters on multi-GPU setups.


JORA: JAX Tensor-Parallel LoRA Library for Retrieval Augmented Fine-Tuning

arXiv.org Artificial Intelligence

The scaling of Large Language Models (LLMs) for retrieval-based tasks, particularly in Retrieval Augmented Generation (RAG), faces significant memory constraints, especially when fine-tuning extensive prompt sequences. Current open-source libraries support full-model inference and fine-tuning across multiple GPUs but fall short of accommodating the efficient parameter distribution required for retrieved context. Addressing this gap, we introduce a novel framework for PEFT-compatible fine-tuning of Llama-2 models, leveraging distributed training. Our framework uniquely utilizes JAX's just-in-time (JIT) compilation and tensor-sharding for efficient resource management, thereby enabling accelerated fine-tuning with reduced memory requirements. This advancement significantly improves the scalability and feasibility of fine-tuning LLMs for complex RAG applications, even on systems with limited GPU resources. Our experiments show more than 12x improvement in runtime compared to Hugging Face/DeepSpeed implementation with four GPUs while consuming less than half the VRAM per GPU.


Computron: Serving Distributed Deep Learning Models with Model Parallel Swapping

arXiv.org Artificial Intelligence

Many of the most performant deep learning models today in fields like language and image understanding are fine-tuned models that contain billions of parameters. In anticipation of workloads that involve serving many of such large models to handle different tasks, we develop Computron, a system that uses memory swapping to serve multiple distributed models on a shared GPU cluster. Computron implements a model parallel swapping design that takes advantage of the aggregate CPU-GPU link bandwidth of a cluster to speed up model parameter transfers. This design makes swapping large models feasible and can improve resource utilization. We demonstrate that Computron successfully parallelizes model swapping on multiple GPUs, and we test it on randomized workloads to show how it can tolerate real world variability factors like burstiness and skewed request rates. Computron's source code is available at https://github.com/dlzou/computron.


Understanding Memory Requirements for Deep Learning and Machine Learning

#artificialintelligence

Building a machine learning workstation can be difficult, not to mention choosing the right workstation with the proper machine learning memory requirements. There are a lot of moving parts based on the types of projects you plan to run. Understanding machine learning memory requirements is a critical part of the building process. Sometimes, though, it is easy to overlook. The average memory requirement is 16GB of RAM, but some applications require more memory.


9 libraries for parallel & distributed training/inference of deep learning models

#artificialintelligence

In this blog we will cover a few basics of large model training before jumping to the list of libraries available. To skip the basics of large model training and jump to the list of libraries click here. Large deep learning models require significant amount of memory to train. Models require memory to store intermediate activations, weights etc.. while training. Some models can be trained only with a very small batch size on a single GPU while other models may not fit on single GPU.


Fundamentals of Deep Learning for Multi-GPUs (Day 2)

#artificialintelligence

Note: By registering for Day 1 you will automatically be registered for Day 2. You cannot register for Day 2. This page is a placeholder. This workshop teaches you techniques for training deep neural networks on multi-GPU technology to shorten the training time required for data-intensive applications. Working with deep learning tools, frameworks, and workflows to perform neural network training, you'll learn concepts for implementing PyTorch multi-GPUs to reduce the complexity of writing efficient distributed software and to maintain accuracy when training a model across many GPUs. Workshop format: Interactive presentation with hands-on exercises Target audience: This workshop is intended for researchers that would like to use multiple GPUs to train deep learning models in PyTorch. Knowledge prerequisites: Participants should be comfortable with training deep learning models using a single GPU.


A Frequency-aware Software Cache for Large Recommendation System Embeddings

arXiv.org Artificial Intelligence

Deep learning recommendation models (DLRMs) have been widely applied in Internet companies. The embedding tables of DLRMs are too large to fit on GPU memory entirely. We propose a GPU-based software cache approaches to dynamically manage the embedding table in the CPU and GPU memory space by leveraging the id's frequency statistics of the target dataset. Our proposed software cache is efficient in training entire DLRMs on GPU in a synchronized update manner. It is also scaled to multiple GPUs in combination with the widely used hybrid parallel training approaches. Evaluating our prototype system shows that we can keep only 1.5% of the embedding parameters in the GPU to obtain a decent end-to-end training speed.


GitHub - royorel/StyleSDF

#artificialintelligence

Training files will be released soon. StyleSDF is trained only on single-view RGB data. The 3D geometry is learned implicitly with an SDF-based volume renderer. We introduce a high resolution, 3D-consistent image and shape generation technique which we call StyleSDF. Our method is trained on single-view RGB data only, and stands on the shoulders of StyleGAN2 for image generation, while solving two main challenges in 3D-aware GANs: 1) high-resolution, view-consistent generation of the RGB images, and 2) detailed 3D shape.